Key Phrase Based - Graph Representation for Contextual Similarity Between Documents

نویسندگان

  • Ruturaj M. Dhekane
  • Sumeet Khurana
چکیده

Finding similarity between documents which have no common key words has not received much attention till now. Here we develop a graph based representation for finding contextual similarity between documents which are totally disjoint in terms of its keywords. For this a bi-grams based key phrase approach is designed. Different algorithms for pairwise similarity were studied and evolved to suit them for our application. A classification technique using a key phrase graph was designed to classify a documents key phrases into commonly occurring contextually similar keywords. We give results and demonstrate the capability of our system to find contextual similarity between two docu-

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Phrase-based Document Similarity Based on an Index Graph Model

Document clustering techniques mostly rely on single term analysis of the document data set, such as the Vector Space Model. To better capture the structure of documents, the underlying data model should be able to represent the phrases in the document as well as single terms. We present a novel data model, the Document Index Graph, which indexes web documents based on phrases, rather than sing...

متن کامل

A Graphical Framework For Contextual Search And Name Disambiguation In Email

Similarity measures for text have historically been an important tool for solving information retrieval problems. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a lazy graph walk. We provide a detailed instantiation of this framework for email data, where content, social networks and a timeline are integrated in a struct...

متن کامل

Random Indexing for Searching Large RDF Graphs

Querying large RDF spaces with traditional query languages such as SPARQL is challenging as it requires a familiarity with the structure of the RDF graph and the names (URIs) of its classes, properties and relevant individuals. In this paper, we propose a complementary approach based on Vector Space Models (VSM), more concretely Random Indexing (RI) [1] for building a semantic index for a large...

متن کامل

A Graphical Framework for Contextual Search and Dismabiguation in Email

Similarity measures for text have historically been an important tool for solving information retrieval problems. In many interesting settings, however, documents are often closely connected to other documents, as well as other non-textual objects in structure-rich data. In this paper we consider extended similarity metrics for documents and other objects embedded in graphs, facilitated via a l...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009